Improving effectiveness of mutual information for substantival multiword expression extraction

نویسندگان

  • Wen Zhang
  • Taketoshi Yoshida
  • Xijin Tang
  • Tu Bao Ho
چکیده

0957-4174/$ see front matter 2009 Elsevier Ltd. A doi:10.1016/j.eswa.2009.02.026 * Corresponding author. E-mail addresses: [email protected] (W. Zh Yoshida), [email protected] (X. Tang). One of the deficiencies of mutual information is its poor capacity to measure association of words with unsymmetrical co-occurrence, which has large amounts for multi-word expression in texts. Moreover, threshold setting, which is decisive for success of practical implementation of mutual information for multi-word extraction, brings about many parameters to be predefined manually in the process of extracting multiword expressions with different number of individual words. In this paper, we propose a new method as EMICO (Enhanced Mutual Information and Collocation Optimization) to extract substantival multiword expression from text. Specifically, enhanced mutual information is proposed to measure the association of words and collocation optimization is proposed to automatically determine the number of individual words contained in a multiword expression when the multiword expression occurs in a candidate set. Our experiments showed that EMICO significantly improves the performance of substantival multiword expression extraction in comparison with a classic extraction method based on mutual information. 2009 Elsevier Ltd. All rights reserved.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving LNMF Performance of Facial Expression Recognition via Significant Parts Extraction using Shapley Value

Nonnegative Matrix Factorization (NMF) algorithms have been utilized in a wide range of real applications. NMF is done by several researchers to its part based representation property especially in the facial expression recognition problem. It decomposes a face image into its essential parts (e.g. nose, lips, etc.) but in all previous attempts, it is neglected that all features achieved by NMF ...

متن کامل

Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units

The availability of contiguous and non-contiguous multiword lexical units (MWUs) in Natural Language Processing (NLP) lexica enhances parsing precision, helps attachment decisions, improves indexing in information retrieval (IR) systems, reinforces information extraction (IE) and text mining, among other applications. Unfortunately, their acquisition has long been a significant problem in NLP, ...

متن کامل

Language Independent Automatic Acquisition of Rigid Multiword Units from Unrestricted Text Corpora

Multiword units are groups of words that occur together more often than expected by chance in sub-languages. Président de la République, Coupe du monde and Traité de Maastricht are multiword units. Unfortunately, most of the machine-readable dictionaries contain clearly insufficient information about multiword units. Therefore, their automatic extraction from corpora is an important issue not o...

متن کامل

Combining Linguistics with statistics for multiword term extraction: a fruitfull association?

The acquisition of multiword terms from large text collections is a fundamental issue in the context of Information Retrieval. Indeed, their identification leads to improvements in the indexing process and allows guiding the user in his search for information. In this paper, we present an original methodology that allows extracting multiword terms by either (1) exclusively considering statistic...

متن کامل

Syntax and Semantics vs. Statistics for Italian Multiword Expressions: Empirical Prototypes and Extraction Strategies

In this work we present an empirical analysis performed on Italian nominal multiword expressions (MWEs) of the form [noun + adjective] that aims at studying quantitatively their syntactic and semantic features in order to improve their automatic identification and collection. Three indices are proposed, which are able to measure syntactic and semantic frozeness of the expressions on empirical b...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Expert Syst. Appl.

دوره 36  شماره 

صفحات  -

تاریخ انتشار 2009